• Critically assess the basic principles of different
statistical techniques.
• Be able to select the correct
statistical test depending on the experimental design and data type.
• Use R syntax and ecosystem to perform data analysis
tasks.
1. Descriptive Statistics
Inferential Statistics
Hypothesis Testing
▪️ A Sample or Descriptive
Statistic is a number that summarises data.
▪️ Some of the most common sample statistics are the mean, the standard deviation, the median, the maximum, and the minimum.
The most popular measure of central tendency is the mean also known as the simple average.
\(\LARGE\bar{x}=\frac{x_{1} + x_{2} + x_{3} +....+ x_{1}}{n}\)
Downside of using the mean as measure of central
tendency?
A better measure of Central Tendency is the
Median which represents the middle number in an
ordered dataset and is NOT affected by outliers.
A measure of Asymmetry in a dataset is
Skewness.
Skewness indicates if the observations in a dataset are concentrated (skewed) on one side.
The file event_times.txt contains the
time (s) when consecutive cell divisions occur in a cell line culture.
We want to examine the distribution of waiting times between
successive cell divisions.
We are interested in calculating the waiting times between cell
divisions
We can use the summary() function to get the main
descriptive statistics for this dataset:
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00634 0.19084 0.33132 0.54319 0.69994 3.10818
Take a note of the Mean and Median values
▪️ If Mean > Median the data have
a positive or right skew.
▪️ If Mean < Median the data have
a negative or left skew.
▪️ If Mean = Median the data are
completely symmetrical.
What type of distribution/skew do we have in this case?
Univariate Measures of Dispersion:
▪️ Sample variance \(s^2\) measures the dispersion of a
set of data points around their mean value.
▪️ The variance formula for the sample is more
conservative.
▪️ The (n-1) term in the formula accounts for the possibility that the variance captured by the sample is more than the variance of the population.
▪️ Variance calculations can often result in large
values as the term \((x_{i}-\bar{x})^2\) is squared.
▪️ Solution: use the square root instead.
▪️ The Coefficient of Variation (CV) or Relative Standard Deviation (RSD) is used to compare the standard deviations of data recorded in different units, e.g. Kg vs g.
▪️ The RSD or CV can also be expressed as a percentage.
Inferential Statistics refer to the branch of Statistics that rely on Probability Theory and Distributions to predict population values based on sample data.
Let’s go back to the example of trying to calculate the mean diastolic blood pressure (MDBP) for the adult population of Massachusetts. In an effort to standardise our experiment, we have collected three samples of 20 volunteers each and the mean values are: 75.2, 79.5, 80.1 mm Hg.
This difference between sample means is called Sampling Error or Sampling Variability
The best way to reduce the sampling error is by increasing the sample size.
We can use a simple simulation example to see what
happens when we draw repeated samples of equal size from the same
population.
This means any variation observed will be due to
sampling error.
▪ Define a normally disitributed polulation with mean
value MDBP = 78 mm Hg and sd = 6.
▪ Draw 10,000 samples of size 20 from the above
population.
▪ Examples of sample means
## [1] 77.98933 78.66990 77.30639 77.07685 79.63759 77.48844
▪ A sampling distribution shows the expected range
and frequency of outcomes when we repeat the same sampling
process.
▪️ An alternative way of thinking of distributions is in terms of how likely it is for an outcome to occur instead of how often it occurs.
They help convert the frequency of an outcome into a probability of observing this outcome by consulting the probability distribution.
Completely symmetrical with the most probable values
centred around the mean.
A special normal distibution with a mean = 0, and sd = 1 [N(0,1)].
If we have approximately normally distributed data, we can apply z-score standardisation to transform the dataset into one with a standard normal distribution.
Once we have acquired the z-scores we can compare
them against probability tables for the probability of getting this
score.
A distribution is normal if:
The Central Limit Theorem shows the following:
When we increase the sample size (or the number of samples), then the sample mean will be closer to the population mean (Law of Large Numbers).
“If we have a sample with more than 30 observations, we can accept that it is coming from a sampling distribution with a mean equal to the population mean”.
▪️ With multiple large samples, the sampling
distribution of the mean is normally distributed, even if the
original variable is not.
▪️ We can use parametric tests for large samples from
populations with any kind of distribution as long as other important
assumptions are met.
▪️ For small samples, the assumption of normality is important because the sampling distribution of the mean isn’t known.
\[σ/\sqrt{n}\]
“The standard error of an estimate is the standard deviation of the estimate’s sampling distribution”.
❕ The key point to remember is that the standard error (SE,” or se) is a measure of the spread, or dispersion, of the sampling distribution.
Standard Deviation
tells us how far each value lies from the mean within a single dataset
(A descriptive statistic).
Standard Error tells us how accurately our sample data represents the whole population (An inferential statistic).
▪️ Another way of estimating how well the sample
describes the population is by calculating confidence
intervals.
▪️ For given α the margin of error m for a CI is: \(m = mean ± Z_{a/2}*SE\).
▪️ Confidence Intervals are a range of values where
the population mean is likely to fall.
| Desired CI | Z Score |
|---|---|
| 90% | 1.645 |
| 95% | 1.96 |
| 99% | 2.576 |
So what is a hypothesis? 🤔
Intuitively “A
hypothesis is a statement that can be tested”.
Example: The mean length of newborn babies in the UK
is equal to 50cm.
A hypothesis can be TRUE or FALSE. The two scenarios are covered by the Alternative and Null Hypothesis respectively.
The Null hypothesis (\(H_{0}\)) says what our theory predicts will be FALSE.
The Alternative hypothesis (\(H_{1}\)) says what our theory predicts will be TRUE.
When conducting hypothesis testing the alternative hypothesis can be two sided or one sided.
Remember that the alternative hypothesis \(H_{1}\) cannot be proved.
What we
are trying to do is reject the Null hypothesis \(H_{0}\).
According to the National Institute of Health in the U.S. an estimated 31.9% of U.S. adolescents aged 13-18 had any anxiety disorder 😟 in the period 2001-2003.
We hypothesise that in recent years the
prevalence of anxiety disorders in adolescents in the U.S. has
risen.
In hypothesis testing we have three things we need to
define:
A) The Null Hypothesis \(H_{0}\) we are trying to
reject.
B) The rejection region.
C) The significance level.
After defining the Null Hypothesis we need to define the Rejection
Region.
How is the rejection region defined?
Assume we are interested in testing the following statement: “The average mean birth weight of babies born 👶 in a large UK hospital🏥 is 3900 g”.
We don’t agree with this statement and we declare that: “The average birth weight of newborn babies in this hospital is different to 3900 g”.
\(H_{0}\): birth
weight = 3900 g
\(H_{1}\): birth
weight ≠ 3900 g
After obtaining the birth records for all babies in this hospital
born in the last year the mean weight was 3460 g with a
sd of 495 and the data were normally distributed
N~(μ=3460, σ=495).
Rejection region at significance level α = 0.05.
From the Figure below, can we can reject the Null hypothesis?
The probability of making a Type I error is α.
Type I errors are more serious and tests are usually designed to reduce the probability of type I errors (e.g. Post-Hoc tests).
A test of significance finds the probability of getting an outcome as extreme or more extreme than the actually observed outcome assuming the Null hypothesis is TRUE.
▪️ we can use the z scores to assess how far away the estimate is from the population parameter.
▪️ We call these scores a test statistic which has the purpose of measuring compatibility between the Null hypothesis and the data.
\(z=\frac{estimate
-hypothesised\;value}{standard\;deviation\;of\;the\;estimate}\)
▪️ estimate = the observed value for a statistic
acquired from the sample.
▪️ hypothesised value = the value we
attribute to the parameter under the Null hypothesis.
▪️ standard
deviation of the estimate =the sd of the sampling
distribution.
Now assume we want to test whether there is a difference in birth weight between boys 👦 and girls 👧 in the country.
What does our hypothesis look like?
To test the hypothesis we look at many different samples we find boys
are on average 200 g heavier than girls with a sd=60 g.
Is this
difference statistically significant?
The z statistic in this case would be: \(z= \frac{200 - 0}{60} =
3.33\)
In our example this translates as:
P(Z ≤ -3.33 or Z ≥ 3.33) –> P(|Z|≥ 3.33) = 2P(Z ≥
3.33)
From the table of z scores we find:
2P(Z≥ 3.33) = 2(1-0.9996) = 0.0008.
This is the P-value of the test.
The P-value of a test is the probability that the
test statistic would take a value as extreme or more extreme than that
actually observed assuming that \(H_{0}\) is true.
If the P-value is ≤ α, we say that the data are statistically significant at level α.
The new statistic is called: one-sample t-statistic:
\[{t = \frac{\bar{x} - μ}{s /
\sqrt{n}}}\]
The denominator is called the standard error of the sample mean and it is used to estimate the unknown standard deviation of the sample mean: \[σ / \sqrt{n}\]
The type of t-distribution for a given sample is dependent on the sample size (n)!
For a random standard variable T having the t(n-1) distribution, the P-value for a test of \(H_{0}\) against all possible alternatives is calculated as:
A) For \(H_{1}: μ >
μ_{0}\) the P-value is: P(T ≥ t)
B) For \(H_{1}: μ < μ_{0}\) the P-value is: P(T ≤
t)
C) For \(H_{1}: μ ≠ μ_{0}\)
the P-value is: 2P(T≥ |t|)
▪️ In many research studies our purpose is to see if
a treatment has an effect on a population.
▪️ For the study
results to be valid we need to include a control group
as well as the treatment group.
▪ This is often called the Two-sample problem.
📝 In addition, there is no need the two groups to have the same size, as would be the case in matched-pair designs.
We have a clinical trial where volunteers are randomly assigned to a group receiving a treatment and a control group receiving a placebo.
▪️ The same variable is measured in both groups but
we call the variable \(x_{1}\) in the
treatment group and \(x_{2}\) in the
placebo group as their distribution may be different.
| Population | Sample Size | Sample Mean | Sample standard deviation |
|---|---|---|---|
| 1 | \(n_{1}\) | \({\bar{x_{1}}}\) | \(s_{1}\) |
| 2 | \(n_{2}\) | \({\bar{x_{2}}}\) | \(s_{2}\) |
\[{t = \frac{(\bar{x_{1}} - \bar{x_{2}}) -
{(μ_{1} - μ_{2})}}{\sqrt{(\frac{s_{1}^2} {n_{1}})-(\frac{s_{2}^2}
{n_{2}})}}}\]
📝 To decide whether we can reject the Null
Hypothesis in favour of the Alternative \(H_{1}\): \(μ_{1}\) ≠ \(μ_{2}\), we look at the p-values for the
t(k) distribution which is an approximation for the two-sample
t-statistic distribution.
The degrees of freedom
k are either approximated by software or are the smaller of
\({n_{1} - 1}\) vs \({n_{2} - 1}\).
\[{effect-size = \frac{mean_{treatm} -
mean_{control}}{sd_{control}}}\]
😕 < 0.1 = trivial effect
😐
0.1 - 0.3 = small effect
🙂
0.3 - 0.5 = moderate effect
🎉
> 0.5 = large difference effect